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Abstract 

We present a new class of metrics for unrooted phylogenetic X- 
trees derived from the Gromov-Hausdorff distance for (compact) met¬ 
ric spaces. These metrics can be efficiently computed by linear or 
quadratic programming. They are robust under NNI-operations, too. 
The local behavior of the metrics shows that they are different from 
any formerly introduced metrics. The performance of the metrics is 
briefly analised on random weighted and unweighted trees as well as 
random caterpillars. 


1 Introduction 

The idea for this paper came from a talk of Michelle Kendall at the Porto- 
bello conference 2015, see 12 ! . Basically, she postulated, that the biological 
information is essentially encoded in the collection of distances between the 
MRCA of two taxa and the root. If the trees were ultrametric, we could 
equivalently just collect the distances between all pairs of taxa. That leads 
to our rationale: 

Instead of trees we compare the induced metric spaces. 

This approach is feasible since by the work of Buneman SHE], see also [34] for 
the unweighted case, we can identify tree-induced metrics among all metrics 
by the famous four point conditions. 

In fact, this rationale must have been behind the invention of the d l and 
£ 2 path difference distances [331 EZj already. Below we invent also an £°° 
version of that metrics, too. 
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For (compact) metric spaces there is the well-known Gromov-Hausdorff 
dist ance 

D GH ((X : d ), (X',d')) = inf p H {v(X)^\X')) (1) 

vw 

where the inhmnm is taken over all isometric embeddings of A", X' into a 
common metric space, and p H is the Hausdorff metric on the compacts of 
that space. 

By onr rationale, this definition induces a metric on the space of all 
weighted trees. But, we cannot distinguish trees which yield isomorphic 
metric spaces, i.e. with permuted labels. Since our aim is to compare trees 
with the same taxon sets we have to adapt the metric (jT]) to our situation. 
That makes the definition more complicated (see section 2) since we have 
to match the leaf labels, but the idea of embeddings remains. Fortunately, 
our metric becomes efficiently computable only this way. Simply, we must 
substitute the Hausdorff metric in (JTj) . Since there are several reasonable 
candidates for that, we derive even three different metrics. In all these cases, 
the value of the metric is the solution of a linear or quadratic program. 

Clearly, our approach is more general and abstract than other definitions 
of phylogenetic metrics to be discussed soon. Those are using much more 
the internal structure of trees. Usually, more abstract approaches have more 
potential to generalise and to adapt to special situations. Still, this has to 
be worked out in the present situation. 

For mathematical reasons, it is very convenient to include also semimet¬ 
rics on the taxon set in the definition. This situation may occur in phylogeny 
if we do not resolve the topology by all singleton splits, see for instance [30j . 

What about other phylogenetic metrics? The simplest one, though not 
the oldest one, seems to be the Robinson-Foulds distance [23ESJ- That one is 
easy and efficiently to compute in linear time H2 or even in sublinear approx¬ 
imation But, it has no much power in discriminating trees, since a lot of 
trees with similar biological meaning have distance equal to the diameter of 
the unweighted tree space. Much nearer to biology seems to be a variant of 
the Robinson-Foulds distance, the weighted matching distance. It captures 
similarity of splits which entails a lot of biology and is still computable in 
subcubic time [31 [22]. 

A quite natural, biology adapted way of capturing tree similarity is pro¬ 
vided by the tree rearrangement metrics. There are different basic trans¬ 
formations giving rise to the NNI-distance [2B], SPR-distance and TBR- 
distance. Unfortunately, computation of those distances is NP-hard and only 
feasible for small trees UDim®. Some fixed parameter approach to compute 
the (rooted) SPR, e.g, was done in [32]. Even more at the heart of evolution 
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is the maximum parsimony distance [IB]. Still it is NP-hard to compute that 
distance, even over binary unweigthed phylogenetic trees [IB] 120] . 

A good alternative to the tree rearrangement metrics is the quartet dis¬ 
tance D3- It is much more biologically plausible than the Robinson-Foulds 
distance and also efficiently computable (ij. 

For weighted phylogenetic trees there is the euclidean type (geodesic) 
distance on tree space introduced by [2]. The crucial observation was that 
in a natural way tree space is a category zero (CAT(O)) space (or space 
of nonpositive curvature) introduced by Gromov. Essentially this property 
implies uniqueness of geodesics. It was an open problem for some years how 
to compute the geodesic distance on tree space efficiently. Yet, by [23j we 
have a polynomial time algorithm now. The CAT(O) idea was used again in 
HP to develop metrics for ultrametric spaces. Again, efficient computation 
of the geodesics is possible for at least one of the metrics. As observed in that 
work, different, but natural, parametrisations may yield different geodesics. 

Recently, [2TJ returned back to the idea of [33], [27] and [2] in application 
to weighted rooted trees, considering all distances of MRCAs of pairs of taxa 
to the root. She also proposes to weight different MRCAs by their depth 
respective the root. That idea may be similar to the weighted matching 
distance [3] 122] - 

A good review about recent developments in polynomial time computable 
metrics on unweighted phylogenetic trees is contained in [T|. There also 
complete j ava implementations are provided. For simplicity, we implemented 
our metrics in R first. 

After having this short overview over this situation, we would like to 
introduce the notion of a biocomputable metric. That should be a metric on 
phylogenetic tree space which is computable in polynomial time and which 
is able to capture biological similarity. Preferably, it should be also defined 
for weighted phylogenetic trees. So, let’s see how Gromovs idea of joint 
embeddings helps to reach that goal ... 

2 Definition 

For a set X denote by M(X) the set of all semimetrics on A", i.e. all p : 
IxI-> K>o such that for all x,y,z G X p(x, x) = 0, p(x, y ) = p(y , x) and 
p(x,y) < p(x,z) + p(z,y). Frequently, we describe such a semimetrics in an 
equivalent fashion by p : (f)) —>• M> 0 where ('()) = {{x,y} : x,y E X,x y}. 
Accordingly, M >0 (X) denote the set of all metrics on X. Further, let M. = 
{(A", p) : #A < oo, p G M(X)} denote the set of all finite semimetric spaces. 
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Isometries (p : (X,p) —> (Y, p r ) preserve the semimetrics, i.e. for all x,y E X 
P(x,y) = f/(<p(x),(p{y)). 

Frequently we need identical copies of our taxon set X. Under slight abuse 
of notation, we will denote them X' = {x' : x E X} and X" = {x" : x E X}. 

Definition 1. Let X be a finite set. Then we define three functions D±, D 2l 
on M(X) x M(X) by 

Di(p,p') = Jnf Vd'(^)i(i)) 

D 2 (p,p') 2 = inf J2d(ip(x)X(x)) 2 

x£X 

D oo(p,p') = inf maxd(p(x),'fi(x)) 

Y,ip,yj x£X 

where the infimum is taken over all (Y, d) E Ai and all isometries ip : 
(X,p) —>■ (Y,d), 0 : (X,p') —>• (Y, J). 

Remark 1 . is nearest to the Gromov-Hausdorff distance, which we 
should implement via 

D GH (p,p') = inf max min d(p(x),'fi(y)). (2) 

x€X y€X 

On the other hand, we think that the O-like metric D\ is kind of natural 
for trees. The euclidean geometry which is the basis of D 2 might be good for 
having unique geodesics. This feature is very convenient and at the heart of 
the proposals of and m- 

Let us simplify the optimisation problems present in the definitions of D, 
a bit. In fact, it is enough to have just one model space Y. For p, p' E M(X) 
define the space E(p,p') of their extensions 

E(p,p') = {d e M(X Ulj : Vx, y E X : d{x,y) = p(x,y),d(x’,y') = p'(x,y)} . 
Further, ||• denotes the usual t —norm on M x . 

Lemma 1. For i = 1, 2, oo 

D fip,p')=_ inf ||(dOr,Y)) a , eX || (3) 

deE( P ,p) 

Proof. Note that < holds trivially. 
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On the other side, for (Y,d) £ Ai and isometries (p : (A f p) —> ( Y,d ), 
i/j : (. X , //) —> (Y, d) define d : (' Y ^' Y ) —» M> 0 by 


d(z,3/) = p(x,y) = d((p(x),(p(y)) 
d(x', y') = p'(x,y)=d(i/;(x),‘tl>(y)) 
d(x,y') = d(p(x)^(y)) 

for all x,y £ A". Now d £ M{Y) implies d, £ M(X U A'). The > in ((3]) 
follows now from 


||(d(z,0) 


xex 


(d(<p{x),il>{x))) 


xex 


□ 

Observe that the previous lemma is at the heart of the computation of the 
distances since that amounts just to the minimization of a convex function 
over the convex set E(p,p'). 

Lemma 2. Fori = l,2,oo there exists a d* £ E(p,pt) C M( X U A"') such 
that 

Di{p,p') = ||«(®,® , ))*ex|| i 

Proof. Clearly, the sublevel sets of the convex function H-^ are compact on 
the convex space E(p,p'). Thus there must exist a minimal point of that 
function. □ 

Theorem 1. Di, i = l,2,oo are complete metrics on M(X). 

Proof. Symmetry is clear. 

If Di(p, p') = 0 choose d* £ E(p,p') according to the previous lemma. 
Obviously, we obtain d*(x,x') = 0 for all x £ X. The triangle inequality 
implies for all x,y £ X 

p{x, y) = d* (x, y) = d*(x',y') = p'(x,y) 


such that p = p'. 

Now let there be p,p', p" £ M(X) and i arbitrary. Using again the above 
lemma we choose d\ £ M(X U A"') extending p, p 1 and d 2 £ M(X' U X") 
extending p',p" such that 

Di{p,f/) = ||(cfi(a:, x')) x& x\\i 
Dfpfp") = Wid^x'px'^Uxl 
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Then we find, following [2j or Lemma [TUI some d G M(XUX'UX") extending 
both di,d 2 : 

d|^xux'^ = d\ and d|^x , ux ,, j = d 2 . 

We see now 


Di(p,p ") < \\(d(x,x")) xe x\\i<\\(d{x,x')+d(x',x")) xe x 
< ||(d(a;,ar / )) xe x|| i + ||(d(x , ,a; ,, ))*ex|| i 
= \\(di( x , x '))xex\\i + \\{d 2 (x\ x")) x& x\\i 

= a(p,p') + A(pV')- 


Completeness will be proved later in Lemma [SJ □ 

As already said in the introduction, we are mainly interested in metrics 
on tree space. Let G = (V, E, q ) be a weighted connected graph, i.e. E C 
and q : E —>• R> 0 . The we define the induced semimetric on V by 

d(j(x, y ) = inf {len(p) : p path from x to y } (4) 


As usual, 

m 

len(a;oa;i... x m ) = ^ q({xi-x,Xi}) 

2—1 

is here the length of the path {xqX\ .. .x m ). For unweighted graphs (V, E) 
we choose q({x, y}) = 1 for all {x,y} G E. 

So let the tree space T(X) be the set of all weighted unrooted generalised 
phylogenetic A-trees. A weighted unrooted generalised phylogenetic A"-tree 
is a quadruple (V, E,q, p), where A : X —> V is a (not necessarily injective) 
map such that (V, E) is the minimal tree containing p(X) and q : E —» M>o 
is a weight function. Phylogenetic A"-trees without weights are included by 
given all edges after contraction a weight of 1 and by requiring p to be 
injective. The corresponding subspace will be denoted 7)(A). The set of 
binary (bifurcating) phylogenetic A-trees is denoted Tj 2 (A). Now we define 
for r, t 1 G T( A) under abuse of notation 

) Di(d T \sx\, d T >\/x\) 

where p G M(A) is induced by the tree T\ and p' by r 2 via (J4]) . Again, all 
three are metrics on tree space. This can be seen from the following result, 
provided in essence by [7j. 
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Lemma 3. For p G M(X) there exists an unrooted generalised phylogenetic 
X—tree r G T(X) with p = d T | if and only if for for all x, y,z,w G A" the 
four point condition 

p(x, y ) + p(z, w) < ma x(p(x, z) + p(y, w),p(x, w) + p(y, z)) (5) 

is fulfilled. 

Proof. Identifying points x, y G X with p(x, y) = 0 we can assume that p is 
a metric. That (J5]) is necessary and sufficient now for the existence of r was 
shown in [7j. The splits of r are identified by situations where in (J5J) strict 
inequality holds. Minimality of the vertex set of r (according to definition) 
implies that different edges in r induce different splits. The weight of the 
edge corresponding to a split by (JSJ) computes directly from the difference 
of the right and the left hand side in (J5J). Thus r G T( X) is uniquely 
determined. □ 

Let us compute some example. 

Example 1. We want to compare for X = {A, B,C, D} the two unweigthed 
X—trees 

r = ®^ ._.-® an d t' = ^ ^ 

dr AE dr Ad 

with corresponding distances p, p'. 

We want to derive possible extensions of p, p' by verifying that for some 
5 a, 8b, Sc, do > 0 the graph distances on the weighted graph 



reproduce both p and p'. One obvious choice is 5a = 0 ,5b = 1, 5c — 1,5d — 0, 

i.e. 
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is consistent. Obviously, we embedded now both r and r' into the metric space 
of the graph G. We see D< 1, D 2 < a/2 and D i < 2. In fact equality 
holds, but this we can prove only later in Example [3 

Additionally, we obtain 

Lemma 4. For A > 0 , i = l,2,oo, and pj G M(X), j = 1,2, 3, 4, the 
following are true: 

1. Di(\pi, Xp 2 ) = XDfpi, p 2 ) ■ 

2. Di(pi + p 3l p 2 + pf) < Df Pl ,p 2 ) + A(P3,P4). 

3. Di(\p\ + (1 — X)p 2 , p 3 ) < XDi( Pl ,p 3 ) + (1 — X)Di(p 2 ,p 3 ). 

Proof. The first relation follows from Xd G E(Xpi, Xp 2 ) -<=>- d G E(pi, p 2 ). 

The second relation follows from d\ + d 2 G E(pi + p 3 , p 2 + Pa) for all 
di G E(pi,p 2 ) and d 2 G E{p 3 ,p A ). 

The third relation is just a consequence of the first two. □ 


3 Efficient Computation 

Clearly, 

Lemma 5. D\ and can be computed solving a linear program. For the 
computation of D 2 a quadratic program has to be solved. 

Proof. This follows immediately from Lemma [D □ 

So, we are sure that we can compute the distance in a computing time 
polynomially bounded in n = #X [T9j. In the naive way, the linear (quadratic) 
program has the n 2 variables e xy = d(x,y') and 0(n 3 ) restrictions coming 
essentially from the triangle inequalities in triangles of the form x, y, z' or 
similar. But we can do the computation more efficiently. The essential 
observation is that the objective function depends on the unknown values 
(dfx,x')) x e x only. The reformulation of the constraints is provided by the 
following theorem. It will be proved later in section [A] 

Theorem 2 (quadrangle inequalities). Let p, p' G M(X) and (5^ )xex ^ ®>o 
be given. Then there exists a d G M(X U X') with 

d(x,y) = p(x,y) x,y e X 
d(x', y') = p'(x,y ) x,y e X 
d(x, x') = S x x G X 
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if and only if for all x ^ y G X the following inequalities are fulfilled: 

5 x + 5 y > \p{x,y) -pf{x,y)\ f ) 

I S x -5 y \ < p(x,y) + p'(x,y) 

Thus we have just n variables S x = d{x,x') and 0(n 2 ) constraints for 
each rectangle x, y, y', x' in the optimisation problems ([3]). Formally, A;(p, p') 
solves the program 

|| 51^ —> min under 

> 0 x 6 X 

S x + 8 y > \p(x,y) - f/(x,y)\ x^yeX 

\5x-by\ < p(x,y) + p'(x,y) x^yeX 

Example 2. Let us continue Example\T[ Since p(A,D) = p'(A,D), we see 
from the upper parts of (Q|) 

$a + $b > 1 
Sa + Sc > 1 
$B T > 1 
dc + $D > 1 

Consequently, 

D\(p, p') ^Sa + Sb +he+ df)> 2. 

We already saw that we can realise this minimum. The calculation of D oa (p, p') 
1 was already done, D 2 (p, p') = \/2 is immediate. 

It is very interesting that the upper bounds on the differences are not 
used in the calculation. In fact, we conld not observe any situation where 
they had to be used to determine the minimum. This can be seen also from 
the numerical results in section [7] especially Figure [6j But, we are still 
lacking a proof that we may omit these constraints safely. This leads us to 
the definition of further distances Z)j(p, p') as solution of 

H^l^ —> min under 

S x > 0 i6l (8) 

S x + 5y > \p(x,y) - p\x,y)\ x^yeX 

with the obvious extension to tree space. 

Lemma 6. Di are metrics on M(X) and T(X), too. 
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Proof. Observe that exacly like for the problem (JTJ), also the minimum of ([S]) 
is attained. 

Symmetry of the definition is clear. Further, Di(p,p') = 0 if and only if 
5 = 0 is feasible for the problem ([7]). That means p(x,y ) = p'(x,y ) for all 
{x,y} G (*) and p = p'. 

For the proof of the triangle inequality choose optimal solutions 5 1 G M> 0 
of (jHJ) and 5 2 G M> 0 of the version of ((§]) for p',p". We see for {x,y} G ('J) 
that 

^x+^x+^+^y > \p(x,y) - p\x,y)\+\p\x,y) - p"{x,y)\ > \p{x, y) - p”(x, y)\ 
such 5 1 + 5 2 is feasible for the version of (JHD for p, p". We obtain 

Di(P,p") < r+^||, < ll^ll, + ||<5 2 ||, = A(AP') + D,(p\p"). 

This completes the proof. □ 

Remark 2. Interestingly, there is a striking similarity between the feasible 
set of a and the tight span of a distance matrix introduced in m Yet, 
\p — p'\ is not a semimetric in general and we do not see a deeper connection 
at the moment. 


4 Comparison to other metrics 


First we compare our metrics to the pathwise difference metrics. Recall that 
those are defined by [33, 27] 


A pd (wt 2 ) 


(p T 1 (x,y) p^ (x, 


Interestingly, it seems that D^f 2 was not used before. May be, we can 
immediately explain this. Again we abbreviate n = ffX. 


Theorem 3. For ti,T 2 G T(X) it holds 

A>i(ri,r 2 ) > D 2 (t 1 ,t 2 ) > Doc(t 1 ,t 2 ) > -^D 2 {ti,t 2 ) > ^-Di(ti,t 2 ) 

Di(t-l,T 2 ) > D 2 (t i,T 2 ) > T»oo(n,T 2 ) > ^D 2 (ti,T 2 ) > \Di(t\,T 2 ) 


f ^f D (n,r 2 ) > 

^D 2 pd (ti,t 2 ) > 


A>i(ri,r 2 ) > 

D 2 (t i,r 2 ) > 

Doo(Ti, T 2 ) = 


D i(ti,t 2 ) > 
D 2 (t i,r 2 ) > 
Doo(ti,t 2 ) = 


D? D (n,r 2 ) 

\[^h D 2 D (Tl,T 2 ) 
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Proof. The first relations are well-known for H-^ and translate directly. 

For the second relation we use the first inequality in (j51h This gives us 
for all x 7 ^ y E X 


S x + 8y > \p{x,y) - p'{x,y)\ 

d 2 x + d 2 y > \{d x + d y ) 2 > ^\p(x,y)- p\x,y)\ 2 

ma x{d x :xEX} > ^(<$ x + S y ) > ^ \p(x, y) - p'(x, y)\ 

Summing up the first or the second inequalities for all {x, y} E (^) gives the 
estimates for i — 1, 2. 

The >-estimate for i — oo follows by taking the maximum of the third 
inequality over all {x, yj E (*) . On the other hand, setting 

S z = max 1 1 p Tl (x, y ) - p T2 (x, y ) | : {x, y} E 

z E X. (1U1) is clearly fulhlled and we obtain also the <-estimate. 

The first estimates yield the rest of the second estimates and complete 
the proof. □ 

By the same arguments as in Lemma [21 both (J7J) and (jBJ) possess minimal 
points 6* E R> 0 . As a corollary of the last theorem we find a useful upper 
bound for the elements of these vectors: 



Lemma 7. In the minimisation problems & or w, we may restrict min¬ 
imisation to 5 E M> 0 which fulfil additionally 

6 X <2 D^J) = D™[pJ). 


E.g., the minimisation problems 


IHI* 


min under 


0 < s x 

< 

2£>oo {p,p') 

x E X 

dx T dy 

> 

\p{x, y) — p'(x, y)\ 

x yl y E X 

\dx - dy | 

< 

P(x,y) + p’(x,y) 

x yl y E X 

¥\\i 

-> 

min under 


0 < s x 

< 

2D 00 (p, pf) 

x EX 

dx T dy 

> 

1 p(x,y) -f/(x,y)\ 

x y E X 


yield again Dfip, p') and Dfip, p') as minimal values, respectively. 


(9) 


( 10 ) 
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Proof. Define 5 by S x = min^, 2Z) 00 (p, p')). By the above relation, 5 is 


again in the feasible set of (J7|) and 
completes the proof. 


respectively. Further, 


< 11*11 


□ 


Lemma 8. M(X) and T(X) are complete in each Di, i = 1,2, oo. 

Proof. Clearly, M(X) is complete w.r.t. Dffff*. Since all metrics on M(X) 
are equivalent by Theorem El the same should be true for Di. On T(X) 
we have to observe additionally, that T(X) is closed since both sides of the 
four point conditions ([5]) depend continuously on the metric. Then Lemma 
El implies completeness of T(X). □ 


To show that the new metrics are biologically meaningful, we show that 
they don’t change much under an NNI (nearest neighbour interchange) op¬ 
erations. Such an operation is given by 



where A, B , C, D denote different subtrees. The minimal number of NNI op¬ 
erations to reach r' £ Tf(X) from r £ Tf(X) is the NNI-distance D nni (t, t') 

m- 


Theorem 4. Consider r, r' £ Tf(X) which are away by one NNI operation. 
Then 

Di(t,t') < n 
D 2 (t,t') < y/n 

Doo{t,t') = 1 

Especially, 

D nni (t,t') > > -^=D 2 (t,t') > -Di(r, r'). 

Jn n 


Proof. Let be r 



JN 

^0 


and r' 



where 


A, B, C, D are the four subtrees of r, r' corresponding to a four-partition of 


X. 
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Then we observe the following structure of the matrix A G M>q, A Xj2/ = 

(I Pr{x,y) ~ Pr'(x,y)\)x,yeX'- 


A 


/0 1 1 0 \ 
10 0 1 
10 0 1 
\0 1 1 0 / 


or more precisely 

f 1 x G A U D, y G B U C 
A XtV = < 1 y E AU D,x E B Li C 
y 0 otherwise 

The estimates are now immediate from Theorem |3j □ 

Remark 3. Similar estimates could be done for the SPR-metrics. By w 
this has natural implications to the TBR-metrics, too. Further we see that 
the size of the 1—neighbourhood of a tree r G Tf(X) in the D^—metric is at 
least n — 3. 


How large are those bounds compared to the diameter of the space Tf(X)‘? 
We have some crude estimates: 


Lemma 9. For all ti,72 G Ti(X) it holds 

Di(ti,t 2 ) < n • ^ 

A(ti,t 2 ) < y/n ■ 

Doo(ti, r 2 ) < ^ 

Proof. Doc {t i, r 2 ) < follows immediately from t 2 ) < n — 2 which 

holds since all paths in T \, r 2 have at least one and at most (n —1) edges. The¬ 
orem [3] implies the other two inequalities and the estimate on the NNI-metric 
are immediate consequences of its definition. □ 

Now we want to show that there are trees such that the distance between 
them is of the same order in n. 


Lemma 10. Let us be given n = 4m + 1 for some m G N , m > 1, 
X — {1,..., 4m + 1}. Suppose r is the unrooted caterpillar tree with cherries 
{1, 2} and {4m, 4m + 1}: 
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and t' is obtained from r by reversing the order of the even labels, i.e. 2i is 
interchanged with 2(2 m + 1 — i) for i = 1,, 2m: 


t = 


Then 


Gp tp 


A m — 1 ] [ Am + 1) 


4m 


4m — 2 J 

3 

43 


Di{t,t') > D\{t , t') > 
D 2 {t',t') > D 2 (t,t') > 

Dooij, t') = 


4 m 2 — Am + 2 


|m 3 


- 8m 2 + |m 
2m — 1 


Proof. It is easy to see that for 1 < i < j < n = Am + 1 

{ j i = 1, 2, j < Am 

Am i = 1,2, j = Am + 1 

Am + 1 — i 3 < i, j = Am, Am + 1 

j — i + 2 otherwise 

First, the formula for D^t, t') follows immediately from Theorem [31 
Continuing, we obtain from (jSJ) the following constraints 


5i + S 2 

> 

4m — 2 

S 3 + £4 

> 

4m — 8 

^5 + ^6 

> 

4m — 12 

S 2 m —1 + S 2m 

> 

0 

<^4m-3 + S,i m - 2 

> 

4m — 8 

Sim -1 + Sim 

> 

4m — 4 


Summing up this constraints directly gives the lower bound for D\{r, t'). 


- gives us 

si + 5 2 

> 

2(2 m - l) 2 

S 3 T $4 

> 

2(2m - 4) 2 

+ 51 

> 

2(2m — 6) 2 

S\m-l + S\ m 

> 

0 

I 2 4- d 2 

> 

to 

"to 

1 

to 

d 2 4- d 2 

°4m— 1 ' °4m 

> 

00 

1 

h— 1 

to 
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Again summing up this yields the lower bound for D 2 (t, t'). 


□ 


Remark 4. Using the results from the next section and computation similar 
to the second next section we could derive the same order of magnitude of D, 
for general n. 


5 Local Properties 

From LemmaUwe obtain for “small” semimetrics p ', p" G M(X) immediately 
that 

A(p + p\ P + p") < AO', p"). 

Notably, we can even sharpen this estimate: 

Lemma 11. For all p G M >0 (X) there is a e > 0 such that for p', p" G M(X) 
with A»(0,p')> A»(0,p") < e 

Di(p + p', p + p") = Di{p', p") 


Remark 5. For r G T(A) the condition p G M >0 (X) just means that the 
labeling is injective. Thus it is weaker than to say that r is an inner point 
of some orthant of tree space as considered in meaning the tree is binary 
and all edge lengths are positive. 

Further, this result is another proof that the Di are really metrics, see 
Lemma 0 


In the following, let 0 G M(X) denote the zero semimetric on A". 

Proof of LemmaU. 3 By Lemma [3 we may add the constraints 8 X < 2-Dqo( p+ 
p',p + p") = 2 D(p\p”) to © and ((HD to get problems Q and dTO]) . respec¬ 
tively. 

Now it is easy to derive that for 


1 

£ — — mm 
2 


| p(x,y) : {x,y} G 



and p, p' G M(X), D^O, p'), D^Q, p") < e, the constraints 
\8 x ~8y\< 2 p(x, y ) + p’(x, y) + p"(x, y) 
are automatically fulhlled. Removing them yields problem (fTOlh □ 
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Example 3. So it is interesting to ask for D,(0, t a R ) for a very simple r, 

we choose r = | A | — | B where A\B is a split of X and l is the length of this 
split. 

We see that the constraints from f3) turn into 


> 0 


xeX 


5 X dy ^ l x G A, y £ B 

Now (G|) is symmetric under permutations of A and under permutations of 
B. Thus we may simply assume that 

r _ J a x e A 
dx ~ \ b xeB 

for some a,b £ M> 0 with a + b > l. 

For computing D\, we find 


1 = #Aa + #Bb > #Aa + #B{1 - a). 

The later function of a has minimum D\{f),T l AB ) = mm(ffA,ffB)l. 
Similarly we find for D 2 

IWIa = fiAa 2 + ffBb 2 > ffAa 2 + #B(l - a) 2 . 


Now the minimum is D 2 (0 ,t 1 a b ) = l. 

Summarisingly, we observe that different splits of a tree get different 
weights. 

Moreover, we see that the minimal points 5* fulfil all contraints in &■ 
This shows Di = Di. Further, the same computations are valid if we compute 
^i( r A bi t a b) with 1^ — l'\ replacing l: 

D i(r l AtB , r l A B ) = min(#A, ffB) \l - l'\ 

n (JL Jl' \ _ IlfAffB . 

D2\T a bi T A: b) ~ Y ~ l‘ “ ‘ I 


Example 4. We want to compute Di(0,T l AB c ) for 


r = 


A — B — C 


This tree is the essence of two trees with same shape but differing in the 
lengths of two edges. 
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Again, symmetry gives us to consider only 


{ a x £ A 
b x £ B 
c x £ C 

for some a,b,c £ M>o which fulfil now 

a + b > l 

b + c > l' 

a -\- c A l -\- V 

which gives us a linear or quadratic program in M> 0 
For computing D\, we want 

ffAa + ffBb + ffCc (->• min 


( 11 ) 


on this set. We know, that this minimum is achieved in a corner of the fea¬ 
sible set. But, we see easily that not all inequalties in m could be equalities 
unless b = 0. Thus at least one ofa,b,c must be zero and we obtaine the 
minimal value as 


min {#Al + #Cl', (#5 + #C)l + #CT, + (#5 + #C)Z'} 

A distinction of cases whether jfA | jfB + jfC and ffC ^ ff A + jfB gives 
us in any case one of the value as minimum. Thus in any case, Di(f),T 1 ^ B c ) 
is a linear combination of l and l', i.e. some weighted i 1 — distance. 

The computation of D 2 would mean solving the quadratic program 

ffAa 2 + ffBlr + ffCc 2 (->■ min 

For this problem, we only know that the solution is the projection of the 
null vector onto the affine hyperspace determined by some face of the fea¬ 
sible set. This projection is linear in l and V. This means that D 2 is the 
minimum of several quadratic functions in 1,1'. Since the algebra is rather 
tedious we stop here now with the indication that this minimum is just a sin¬ 
gle quadratic function similar to the linear case before. A numerical test for 
several cardinalities and random lengths l,V provided in Figure [I] shows that 
the parallelogramm equality is fulfilled in all considered situations. Thus the 
local geometry seems to be euclidean. This was our original expectation when 
we introduced D 2 . But even if this would be true in general, we are already 
asured by the previous example that we do not to compute the geodesic metric 
from 
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D|(0- hv.B.c) 



Figure 1: Test of the parallelogramm equality for random lengths 1,1' and 
#A = #B = #C = 1 (above left), = 1, #£ = 2, #C = 3 (above 


right). On the x —axis D 2 (0, t^’ bc ) 2 + D 2 (0, Ta’bc) 2 I s presented. On the 


hA 


y— axis D 2 {Q,t^ + b 2 ^ 1+12 ) 2 + D 2 (0, t^b'c 1 l2 ) 2 I s plotted. Below, the curves 
l 1 —^ D 2 (0, t 1 ^ b c ) 2 for different scenarios on #A, j^B, are plotted. 


h—hA~^2\2 
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6 Monotony 

For any Ao-tree r let t\x denote the restriction to X C X 0 . Observe that 
for r E Ti(Xo) in general r \x f E (X). 

Lemma 12. Let X 0 A X and r, t' E T(X 0 ). Then for i = 1, 2, oo 

A(t, r') > Di(T\x,r'\x ) 

Di(r, t ) > Di(r\x,T'\x) 

Proof. This follows immediately from the same inequalities for semimetrics 
on X 0 . Then, restricting d* E E(p,p') from Lemma [2] to X U X’ yields an 
element ot E(p\ x , p'\ x )- Moreover, 

|| (4)ieA _ o || i A II (4)i6i|| i 

for 5 E M>q completes the calculation. □ 

Remark 6. This result naturally holds for many other phylogenetic metrics: 
for the pathwise difference, NNI-, SPR-, TBR- and maximum parsimony 
metrics, for example. For the tree rearrangement metrics is was shown in JJ\ 
Lemma 2.2]. 

7 Implementation and numerical examples 

The different metrics were implemented by R [35] programs. For solving lin¬ 
ear and quadratic programs the glpkAPI library [38] and quadprog library 
ra were used, respectively. The corresponding R-script can be downloaded 
from the website [5D]. Some testing showed best performance in terms of 
computing time for the dual simplex algorithm in the G-case. The com¬ 
puting time for obtaining the distance between random trees of size 100 was 
around 0.3s which is quite reasonable, see Figure [2j It also compares with the 
computing time of the geodesic distance. The random trees were generated 
by the function rtree of the R library phangorn [36 ]. 

We also compared Di and D t with several other phylogenetic metrics, 
essentially the pathwise difference, the geodesic distance and the Robinson- 
Foulds metric, for n = 10 leaves. For the computation of the geodesic (BHV-) 
metric the R-package distory [39] was used. The results are presented in 
Figure [3] Numerially, we could observe Di = Di in all cases, seee Figure 
GH at the end of the paper. A remarkable correlation between the different 
Gromov-type and the pathwise difference metrics can be observed. There is 
not much correlation to the geodesic distance. May be, the different weigths 
on the internal edges (see example [3]) are responsible for that. 
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computing times of tree metrics, n = 100 



type 


Figure 2: Computing times different metrics (logarithmic scale) for random 
trees with n = 100 using the dual simplex algorithm. From left: D\ but with 
primal simplex algorithm, D±, D 1 for n = 200, D i, D 2 , D 2 , the geodesic and 
the Robinson-Foulds metric. 

Similar pictures are found for unweighted trees, see Figure [U Interest¬ 
ingly, D i = D\ turns out to integer-valued now, see the same figure. That 
is quite a bit surprising since the matrix corresponding to the linear pro¬ 
gram P) is not totally unimodular in the sense of jT8], it contains the 3x3 

A 1 0\ 

submatrix 110 1 with determinant —2. 

\0 1 l) 

Random caterpillars are interesting in their own, the results are presented 
in Figure JEJ We observe that we obtain a much larger maximum of 28 for ZR 
(over the sample) than from random trees. In comparison, the lower bound 
from Lemma m would be much smaller: ^— n + 2 = 17. 

8 Discussion 

What have we achieved? We constructed at least two new biocomputable 
metrics for comparing unrooted, but possibly weighted, phylogenetic trees. 
We think this approach is valuable and could generalise well. One direction is 
the extension to rooted trees. We should then just measure the distance of the 
induced metrics on X U {root}. Another generalisation could be phylogenetic 
networks. Outside phylogeny, there should be applications to other kinds of 
finite labeled metric spaces. At the moment, we are only aware of the papers 
of F.Memoli, e.g. [23], which deals with i v —type Gromov-Hausdorff metrics. 
In general, we follow [3T| in arguing that there is no universal metric 
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Figure 3: Comparison of different metrics for random trees with n = 10. 
Above from upper left: Di , D 2 , D^, Df D , D^ 0 , the geodesic and the 
Robinson-Foulds metric. Below, the distributions are presented in boxplots. 
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Figure 4: Comparison of different metrics for random unweighted trees with 
n — 10. Above from upper left: D\ , D 2 , D oo, D^ D , , the geodesic and 
the Robinson-Foulds metric. In the middle, the distributions are presented 
in boxplots. At the bottom, the frequency table of D\ is presented. 
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boxplots. At the bottom, the frequency table of Di is presented with the 
lower bound from Lemma [TU] in red. 
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for phylogenetic trees which suits perfectly for all purposes. We think that 
every application has its own choice, and we added a further choice to this 
portfolio. Yet, we should discuss further properties of phylogenetic metrics 
to guide the users. Monotony as considered in section [6] is a, yet trivial, 
beginning in this direction. Here we want to discuss some important results 
of the present paper and possible extensions only. 

ft looks interesting to extend the metric to tree shapes, with allowing the 
labels to be permuted. Still that metric differs from the Gromov-Hausdorff 
metric since we allow only matching of the labels in contrast to the weaker 
version in ([2]). For the Gromov-Hausdorff distance it is shown in [25] that 
it is again NP-hard to compute it. We expect the same for the permutation 
approach. 

One important topic which raised up already in [3j m m si] is the 
question how to weight the edges of the trees. We obtained natural weights 
from our approach in Example [3] If those weights do not fit the intention 
of the applicant, it is easy to shorten or lengthen the edges of the trees and 
obtain other metric spaces which could be easily compared. There is also the 
possibility to weight the labels, for instance to account for uneven sampling. 
Then we could adjust to this by weighting the ||-||j norms which leads again 
to similar computations. Note that we met already such a weighted approach 
in the computations in the Examples [3] and [H Further, also a Kantorovich- 
Wasserstein approach similar to [23] might be feasible if the weights of the 
leaves differ between the trees. In summary, our approach is natural but can 
be well adjusted to the needs of applications. 

We showed several properties of the new metrics including compatibility 
with the NNI-metric, a lot of estimates with the pathwise difference met¬ 
rics, local properties related to the lower bound metrics Z)j, and monotony. 
Of course, there are many more questions in this context. Especially we 
would like to sharpen the estimates. We do not know much about the 
1—neighbourhoods on Tf (X ), e.g. whether there are islands in the sense of 
[3]. There are a lot of connections with the quartet, SPR-,TBR-, maximum 
parsimony, weighted matching and BHV-metrics to explore, too. Numerical 
comparison was done for the R-implemented distances only. 

We expect the diameter between two unweighted A"-trees to be realised by 
caterpillar trees. The simulation result in Figure [5] points into this direction. 
A more sharp estimate than provided in Lemma [9] and Lemma [TO] would be 
quite interesting, too. It is still not clear whether and why ZR or D 1 takes 
integers values only on Tf(X). 

The geometry induced by the euclidean type metrics ZZ 2 ,ZZ 2 should be 
further explored, too. It should be interesting to prove it is locally euclidean 
and to find out how the geodesics look like. Possibly, the geodesic distance 
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with respect to D 2 is even another metric. 

Most interesting we find the question whether D = D*. Provable equality 
could save some computing time, at least. For the time until this problem 
is solved, we just know there are new animals in the zoo of phylogenetic 
distances .. .but not, how many. 
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A On metric extensions 

Several times we met the problem whether a partial dissimilarity on A", i.e. 
a map q : E —>■ M>o, E C Q) has an extension to a metric on X. This seems 
to be a well-known problem, one folklore solution I found in dh 

Theorem 5. If the graph G = (X, E) is simple and connected then q : E —>■ 
M>o extends to a semimetric on X if and only if for all x, y G X, {x, y} G E, 
q(x,y) = d q G (x,y). 
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The graph metric d q G was introduced in (@J. 

Although this presents a complete solution of the extension problem we 
want to sharpen this criterion for improved applicability. Still the next result 
should be folklore but 1 could not find it in literature. If p = xqXi ... x m is a 
cycle in a graph (X, E ) we call any pair {x*, Xj} G E, 0 < i, j < m — 1, 2 < 
\i — j\ < m — 2 a chord of p. A cycle p without chord is called minimal cycle. 

Theorem 6. If the graph G = (A", E) is simple and connected, then q : E —)■ 
M>o extends to a metric on X if and only if for all minimal cycles p of G 
and all edges {x, y} in p 

2q(x, y) < lcn(p). (12) 

Proof. We assume the opposite. Thus we find a (non-minimal) cycle p = 
x 0 xi,...x m = x 0 such that e = {x 0 ,Xi} violates (fT2|) . We may assume 
w.l.o.g. that the length of p , m is minimal. 

Non-minimality of p implies that there is a chord {xi,Xj} of p. Since m 
is minimal, we know 

i - 1 

d(xi,Xj ) < y^ j d(x k ,x k+ 1) 

k=i 

and 

i— 1 n— 1 

d(x o, xi) < ^ d(x k , x k+l ) + d(xi, Xj) + ^ d(x k , x k+1 ) 

k= 1 k=j 

Substituting the first inequality into the RHS of the second one yields 

n- 1 

d{xQ,xf) < J2d(x k ,x k+1 ). 

k= 1 

This is (TT2T) . This contradiction completes the proof. □ 

We can use this result for the 

Proof of Theorem 0 We are using Theorem [6] below on X U X' with E = 
(2) U ('2 ) U {{x, x'} : x G X}. The minimal cycles in (A U A') are either 
triangles in A", triangles in X' or rectangles x,y,y',x'. For the two former, 
is equivalent to the triangle inequalities for p, p'. For the latter, (CHID is 
the same as (J6J) . □ 

The following result was used in the proof of Theorem [U 

Lemma 13. Suppose A, Y ., Z are disjoint sets and there are given d\ G 
M( X U Y ) and d 2 G M(Y U Z) such that dih^ = d 2 \^Yy Then there exists 

a d G M(X U Y U Z) such that d\^xuv^ = di and dbyuz^ = d 2 . 
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Proof. Now we apply the theorem to the graph (.X U Y U Z, ( A ^ ) U ( 5 ^ z )) 

with w(u, v) — \ 1 U ' 1 ^ \ ■ Since both XUh and Y U Z are 

v ’ ’ \ d 2 (u,v) u,v e Y U Z 

cliques in this graph, the only minimal cycles are triangles. For them (1121) is 
fulfilled by definition of w. □ 
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